19 research outputs found
Weak Supervision helps Emergence of Word-Object Alignment and improves Vision-Language Tasks
The large adoption of the self-attention (i.e. transformer model) and BERT-like training principles has recently resulted in a number of high performing models on a large panoply of vision-and-language problems (such as Visual Question Answering (VQA), image retrieval, etc.). In this paper we claim that these State-Of-The-Art (SOTA) approaches perform reasonably well in structuring information inside a single modality but, despite their impressive performances , they tend to struggle to identify fine-grained inter-modality relationships. Indeed, such relations are frequently assumed to be implicitly learned during training from application-specific losses, mostly cross-entropy for classification. While most recent works provide inductive bias for inter-modality relationships via cross attention modules, in this work, we demonstrate (1) that the latter assumption does not hold, i.e. modality alignment does not necessarily emerge automatically, and (2) that adding weak supervision for alignment between visual objects and words improves the quality of the learned models on tasks requiring reasoning. In particular , we integrate an object-word alignment loss into SOTA vision-language reasoning models and evaluate it on two tasks VQA and Language-driven Comparison of Images. We show that the proposed fine-grained inter-modality supervision significantly improves performance on both tasks. In particular, this new learning signal allows obtaining SOTA-level performances on GQA dataset (VQA task) with pre-trained models without finetuning on the task, and a new SOTA on NLVR2 dataset (Language-driven Comparison of Images). Finally, we also illustrate the impact of the contribution on the models reasoning by visualizing attention distributions
Estimating semantic structure for the VQA answer space
Since its appearance, Visual Question Answering (VQA, i.e. answering a
question posed over an image), has always been treated as a classification
problem over a set of predefined answers. Despite its convenience, this
classification approach poorly reflects the semantics of the problem limiting
the answering to a choice between independent proposals, without taking into
account the similarity between them (e.g. equally penalizing for answering cat
or German shepherd instead of dog). We address this issue by proposing (1) two
measures of proximity between VQA classes, and (2) a corresponding loss which
takes into account the estimated proximity. This significantly improves the
generalization of VQA models by reducing their language bias. In particular, we
show that our approach is completely model-agnostic since it allows consistent
improvements with three different VQA models. Finally, by combining our method
with a language bias reduction approach, we report SOTA-level performance on
the challenging VQAv2-CP dataset
How Transferable are Reasoning Patterns in VQA?
Since its inception, Visual Question Answering (VQA) is notoriously known as
a task, where models are prone to exploit biases in datasets to find shortcuts
instead of performing high-level reasoning. Classical methods address this by
removing biases from training data, or adding branches to models to detect and
remove biases. In this paper, we argue that uncertainty in vision is a
dominating factor preventing the successful learning of reasoning in vision and
language problems. We train a visual oracle and in a large scale study provide
experimental evidence that it is much less prone to exploiting spurious dataset
biases compared to standard models. We propose to study the attention
mechanisms at work in the visual oracle and compare them with a SOTA
Transformer-based model. We provide an in-depth analysis and visualizations of
reasoning patterns obtained with an online visualization tool which we make
publicly available (https://reasoningpatterns.github.io). We exploit these
insights by transferring reasoning patterns from the oracle to a SOTA
Transformer-based VQA model taking standard noisy visual inputs via
fine-tuning. In experiments we report higher overall accuracy, as well as
accuracy on infrequent answers for each question type, which provides evidence
for improved generalization and a decrease of the dependency on dataset biases